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Abstract 

Independent component analysis (ICA) is a statistical method for transforming an observable multidimensional random vector 
into components that are as statistically Independent as possible from each other. Usually the ICA framework assumes a model 
according to which the observations are generated (such as a linear transformation with additive noise). ICA over finite fields is 
a special case of ICA in which both the observations and the independent components are over a finite alphabet. In this work 
we consider a generalization of this framework in which an observation vector is decomposed to its independent components 
(as much as possible) with no prior assumption on the way it was generated. This generalization is also known as Barlow’s 
minimal redundancy representation problem and is considered an open problem. We propose several theorems and show that 
this NP hard problem can be accurately solved with a branch and bound search tree algorithm, or tightly approximated with a 
series of linear problems. Our contribution provides the first efficient and constructive set of solutions to Barlow’s problem. The 
minimal redundancy representation (also known as factorial code) has many applications, mainly in the fields of Neural Networks 
and Deep Learning. The Binary ICA (BICA) is also shown to have applications in several domains including medical diagnosis, 
multi-cluster assignment, network tomography and internet resource management. In this work we show this formulation further 
applies to multiple disciplines in source coding such as predictive coding, distributed source coding and coding of large alphabet 
sources. 


Index Terms 

Independent Component Analysis, BICA, ICA over Galois Field, Blind Source Separation, Minimal Redundancy Represen¬ 
tation, Minimum Entropy Codes, factorial Codes, Predictive Coding, Distributed Source Coding, Neural Networks. 


I. Introduction 

I NDEPENDENT Component Analysis (ICA) addresses the recovery of unobserved statistically independent source signals 
from their observed mixtures, without full prior knowledge of the mixing function or the statistics of the source signals. 
The classical Independent Components Analysis framework usually assumes linear combinations of the independent sources 
over the field of real valued numbers K 0 - A special variant of the ICA problem is when the sources, the mixing model and 
the observed signals are over a finite field. 

Several types of generative mixing models can be assumed when working over GE(P), such as modulu additive operations, 
OR operations (over the binary field) and others. Existing solutions to ICA mainly differ in their assumptions of the generative 
mixing model, the prior distribution of the mixing matrix (if such exists) and the noise model. The common assumption to 
these solutions is that there exist statistically independent source signals which are mixed according to some known generative 
model. 

In this work we drop this assumption and consider a generalized approach which is applied to a random vector and 
decomposes it into independent components (as much as possible) with no prior assumption on the way it was generated. This 
problem was first introduced by Barlow Q and is considered a long-standing open problem. 

The rest of this manuscript is organized as follows: In Section II we review the work that was previously done in finite 
fields ICA and factorial codes. In Section III we present the generalized binary ICA problem, propose several theorems and 
introduce three different algorithms for it. In Section IV we extend our discussion to any finite alphabet, focusing on the 
increased computational load it confers and suggest methods of dealing with it. We then dedicate Section V to a constrained 
version of our problem, aimed to further reduce the computational complexity of our suggested solutions. In Section VI we 
present several applications for our suggested framework, mainly focusing on the field of source coding. 

II. Previous Work 

In his work from 1989, Barlow Q presented a minimally redundant representation scheme for binary data. He claimed that 
a good representation should capture and remove the redundancy of the data. This leads to a factorial representation/ encoding 
in which the components are as mutually independent of each other as possible. Barlow suggested that such representation may 
be achieved through minimum entropy encoding: an invertible transformation (i.e., with no information loss) which minimizes 
the sum of marginal entropies (as later presented in (|^). 
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Factorial representations have several advantages. The probability of the occurrence of any realization can be simply computed 
as the product of the probabilities of the individual components that represent it (assuming such decomposition exists). In 
addition, any method of finding factorial codes automatically implements Occam’s razor which prefers simpler models over 
more complex ones, where simplicity is defined as the number of parameters necessary to represent the joint distribution of 
the data. In the context of supervised learning, independent features can also make later learning easier; if the input units to a 
supervised learning networks are uncorrelated, than the Hessian of its error function is diagonal, allowing accelerated learning 
abilities |[^. There exists a large body of work which demonstrates the use of factorial codes in learning problems. This mainly 
includes Neural Networks 0,0 with application to facial recognition |j^-0 and more recently. Deep Learning pO) , El- 

Unfortunately Barlow did not propose any direct method for finding factorial codes. Atick and Redlich | [T^ proposed a 
cost function for Barlow’s principle for linear systems, which minimize the redundancy of the data subject to a minimal 
information loss constraint. This is closely related to Plumbey’s GD objective function, which minimizes the information loss 
subject to a fixed redundancy constraint. Schmidhuber 03 then proposed several ways of approximating Barlow’s minimum 
redundancy principle in the non-linear case. This naturally implies much stronger results of statistical independence. However, 
Schmidhuber’s scheme is rather complex, and appears to be subject to local minima Q. To our best knowledge, the problem 
of finding minimum entropy codes is still considered an open problem. In this work we present what appears to be the first 
efficient and constructive set of methods for minimizing Barlow’s redundancy criterion. 

In a second line of work, we may consider our contribution as a generalization of the BICA problem. In his pioneering 
BICA work, Yeredor 03 assumed linear XOR mixtures and investigated the identihability problem. A deflation algorithm is 
proposed for source separation based on entropy minimization. Yeredor assumes the number of independent sources is known 
and the mixing matrix is a K-hy-K invertible matrix. Under these constraints, he proves that the XOR model is invertible and 
there exists a unique transformation matrix to recover the independent components up to permutation ambiguity. 

Yeredor then extended his work 03 to cover the ICA problem over Galois fields of any prime number. His ideas were 
further analyzed and improved by Gutch et al. OH- 

In p8) , a noise-OR model is introduced to model dependency among observable random variables using K (known) latent 
factors. A variational inference algorithm is developed. In the noise-OR model, the probabilistic dependency between observable 
vectors and latent vectors is modeled via the noise-OR conditional distribution. 

In 03’ the observations are generated from a noise-OR generative model. The prior of the mixing matrix is modeled as the 
Indian buffet process pO) . Reversible jump Markov chain Monte Carlo and Gibbs sampler techniques are applied to determine 
the mixing matrix 

Streith et al. OD study the BICA problem where the observations are either drawn from a signal following OR mixtures or 
from a noise component. The key assumption made in that work is that the observations are conditionally independent given 
the model parameters (as opposed to the latent variables). This greatly reduces the computational complexity and makes the 
scheme amenable to a objective descent-based optimization solution. However, this assumption is in general invalid. 

In 1221, Nguyen and Zheng consider OR mixtures and propose a deterministic iterative algorithm to determine the distribution 
of the latent random variables and the mixing matrix. 

There also exists a large body of work on blind deconvolution with binary sources in the context of wireless communication 
and some literature on Boolean/binary factor analysis (BFA) which is also related to this topic 


III. Generalized Binary Independent Component Analysis 

Throughout this paper we use the following standard notation: underlines denote vector quantities, where their respective 
components are written without underlines but with index. For example, the components of the n-dimensional vector X are 
Xi, X 2 , ■ ■ ■ Xn- Random variables are denoted with capital letters while their realizations are denoted with the respective 
lower-case letters. Px (x) = P{Xi = xi,X 2 = X 2 ■ ■ ■) the probability function of X while H (X) is the entropy of X. This 
means H (X) = — (x) log Px (x) where the log function denotes a logarithm of base 2 and lim 2 ;_j.o x log {x) = 0. 

Further, we denote the binary entropy as hb{p) = —p\ogp — (1 — p) log (1 — p). 


A. Problem Formulation 

Suppose we are given a random vector X ^ Px (x) of dimension n and alphabet size A for each of its components. We are 
interested in an invertible, not necessarily linear, transformation Y_ = p(X) such that Y is of the same dimension and alphabet 
size, g : {1,..., A}" {1,..., A}". In addition we would like the components of Y_ to be as ’’statistically independent as 

possible”. 

The common ICA setup is not limited to invertible transformations (hence Y_ and X may be of different dimensions). 
However, in our work we focus on this setup as we would like Y = g{X_) to be “lossless” in the sense that we do not loose 
any information. Further motivation to this setup is discussed in 03 and throughout the Applications section below. 

Notice that an invertible transformation of a vector X, where X is over a finite alphabet of size A, is actually a one-to-one 
mapping (i.e., permutation) of its A” words. For example, if X is over a binary alphabet and is of size n, then there are 2"! 
possible permutations of its words. 
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To quantify the statistical independence among the components of the vector V we use the well-known total correlation 
measure, which was first introduced by Watanabe p6) as a multivariate generalization of the mutual information. 

n 

= (1) 

i=l 

This measure can also be viewed as the cost of coding the vector component-wise, as if its components were statistically 
independent, compared to its true entropy. Notice that the total correlation is non-negative and equals zero iff the components 
of F are mutually independent. Therefore, “as statistically independent as possible” may be quantified by minimizing C'(F). 
The total correlation measure was considered as an objective for minimal redundency representation by Barlow Q. It is also 
not new to finite field ICA problems, as demonstrated in m Moreover, we show that it is specifically adequate to our 
applications, as described in the last section. 

Since we define F to be an invertible transformation of X we have H (F) = H (X) and our minimization objective is 

n 

—>• min. (2) 

i=l 

In this section we focus on the binary case. The probability function of the vector X. is therefore defined by P{Xi ,..., X„) 
over 2" = TV possible words and our objective function is simply 


n 

hb{P{Yi = 0)) —)■ min. 

i=l 


(3) 


We notice that P{Yi = 0) is the sum of probabilities of all words whose bit equals 0. We further notice that the optimal 
transformation is not unique. For example, we can always invert the i*^ bit of all words, or shuffle the bits, to achieve the 
same minimum. 

Any approach which exploits the full statistical description of the joint probability distribution of X would require going 
over all 2” possible words at least once. Therefore, a computational load of at least 0(2") seems inevitable. Still, this is 
significantly smaller (and often realistically far more affordable) than 0(2"!), required by brute-force search over all possible 
permutations. 

Indeed, the complexity of the currently known binary ICA (and Factorial codes) algorithms falls within this range. The 
AMERICA ED algorithm, which assumes a XOR mixture, has a complexity of 0(n^ • 2"). The MEXICO algorithm, which is 
an enhanced version of AMERICA, achieves a complexity of 0(2") under some restrictive assumptions on the mixing matrix. 
In 1221 the assumption is that the data was generated over OR mixtures and the asymptotic complexity is 0(n • 2"). There 
also exist other heuristic methods which avoid an exhaustive search, such as | [77| for BICA or fT^ for Eactorial codes. These 
methods, however, do not guarantee convergence to the global optimal solution. 

Looking at the BICA framework, we notice two fundamental a-priori assumptions: 


1) The vector X is a mixture of independent components and there exists an inverse transformation which decomposes 
these components. 

2) The generative model (linear, XOR held, etc.) of the mixture function is known. 


In this work we drop these assumptions and solve the ICA problem over finite alphabets with no prior assumption on the 
vector X. As a first step towards this goal, let us drop Assumption 2 and keep Assumption 1, stating that underlying independent 
components do exist. The following combinatorial algorithm proves to solve this problem, over the binary alphabet, in 0(n -2") 
computations. 


B. Generalized BICA with Underlying Independent Components 

In this section we assume that underlying independent components exist. In other words, we assume there exists a permutation 
F = 5 (X) such that the vector F is statistically independent P(Fi,... ,F„) = nr=i Denote the marginal probability 

of the bit equals 0 as tt^ = P(li = 0). Notice that by possibly inverting bits we may assume every tt^ is at most ^ and 
by reordering we may have, without loss of generality, that 7r„ < 7r„_i < ■ ■ ■ < tti < 1/2. In addition, we assume a non¬ 
degenerate setup where 7r„ > 0. Eor simplicity of presentation, we first analyze the case where 7r„ < TTn-l < • • • < TTl < 1/2. 

This is easily generalized to the case where several may equal, as discussed later in this section. 

Denote the 2" probabilities of P(F = y) as pi,p 2 , ■ ■ • ,Pn, assumed to be ordered so that Pi < P 2 ^ ^ Pn- We 

first notice that the probability of the all-zeros word, P(F„ = 0,F„_i = 0,... ,Fi =0) = smallest possible 

probability since all parameters are not greater than 0.5. Therefore we have pi = ]/["=! 

Since tti is the largest parameter of all tt^, the second smallest probability is just P(l^ = 0, ...,F 2 = 0,Fi = 1) = 
Tin ■ TTn-l ’ ... ’ 7r2 • (1 — TTi) = p 2 - Therefore we can recover tti from leading to tti = 

We can further identify the third smallest probability as pa = 7r„ • 7r„_i ■ ... ■ ■ {1 — 112 ) ■ tti. This leads to tt 2 = —. 
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However, as we get to pi we notice we can no longer definitely identify its components; it may either equal 7 r„ • 7 r„_i • 

... • TTs • (1 — 7 r 2 ) • (1 — TTi) or 7 r„ • TTn-i • ... • (1 — TTs) • 712 ' TWs ambiguity is easily resolved since we can compute the 
value of 7 r„ • 7 r„_i • ... • 7 r 3 • (1 — ^ 2 ) ■ (1 — tti) from the parameters we already found and compare it with P 4 . Specifically, 
If 7r„ • 7r„_i • ... • TTa • (1 — 712 ) • (1 — tti) 7 ^ p 4 then we necessarily have tt^ • 7 r„_i • ... • (1 — 713 ) • 712 • tti = p 4 from which 
we can recover 7 r 3 . Otherwise 7 r„ • 7 r„_i • ... • (1 — 773 ) • 712 • tti = ps from which we can again recover 773 and proceed to the 
next parameter. 

Let us generalize this approach. Denote as a set of probabilities of all words whose (k + 1)*^,..., bits are all zero. 

Theorem 1: Let j be an arbitrary index in {1, 2,..., N}. Assume we are given that pj, the smallest probability in the 
set of probabilities P{Yn ,..., Li), satisfies the following decomposition 

Pj — '^n ‘ '^n —1 * ... * '^k +1 * (1 * '^k — 1 ‘ ' TTi. 

Further assume the values of Afe_i are all given in a sorted manner. Then the complexity of finding the value of 7 r„ • 7 r„_i • 

... • 7 rfe +2 • (1 — TTk+i) ■ TTfe • ... • TTi, and calculating and sorting the values of A^ is 0 ( 2 ^). 

Proof Since the values of Afc_i and tt^, are given we can calculate the values which are still missing to know A^ entirely 
by simply multiplying each element of Afc_i by . Denote this set of values as Kk-i- Since we assume the set Afe_i is 
sorted then Af;_i is also sorted and the size of each set is 2^“^. Therefore, the complexity of sorting A^ is the complexity of 
merging two sorted lists, which is 0 ( 2 ^). 

In order to find the value of 7 r„ • 7 r„_i •... • ■nk +2 • (1 — T^k+i) • •... • tti we need to go over all the values which are larger 

than Pj and are not in A^. However, since both the list of all probabilities {pi]^^i and the set A^ are sorted we can perform a 
binary search to hnd the smallest entry for which the lists differ. The complexity of such search is 0(log (2^)) = 0{k) which 
is smaller than 0(2^). Therefore, the overall complexity is 0(2^) ■ 

Our algorithm is based on this theorem. We initialize the values pi,P 2 and Ai, and for each step /c = 3... n we calculate 

'^n ‘ '^n —1 * • . • * TTkj-l * (1 '^k) ' —1 ' • . • ’ TTi and Kk—\- 

The complexity of our suggested algorithm is therefore X]fe=i 0{2^) = 0{n ■ 2"). However, we notice that by means of 
the well-known quicksort algorithm the complexity of our preprocessing sorting phase is 0{N\og[N)) = 0{n ■ 2”). 
Therefore, in order to hnd the optimal permutation we need 0{n ■ 2”) for sorting the given probability list and 0(2") for 
extracting the parameters of P{Xi ,..., X„). Notice that this algorithm is combinatorial in its essence and is not robust when 
dealing with real data. In other words, the performance of this algorithm strongly depends on the accuracy of P{Xi ,..., X„) 
and does not necessarily converge towards the optimal solution when applied on estimated probabilities. 

Let us now drop the assumption that the values of tt^’s are non-equal. That is, 7 r„ < 7 r„_i < • • • < tti < 1/2. It may be 
easily verihed that both Theorem and our suggested algorithm still hold, with the difference that instead of choosing the 
single smallest entry in which the probability lists differ, we may choose one of the (possibly many) smallest entries. This 
means that instead of recovering the unique value of -Kk at the iteration (as the values of tt^’s are assumed non-equal), we 
recover the smallest value in the list 7 r„ < 7 r„_i < • • • < tti < 1 / 2 . 

C. Generalized BICA via Search Tree Based Algorithm 

We now turn to the general form of our problem 0 with no further assumption on the vector X_. 

We denote ni= {all words whose bit equals 0}. In other words, H^ is the set of words that “contribute” to P{Yi = 0). 
We further denote the set of H^ that each word is a member of as F^ for all fc = 1... W words. For example, the all zeros 
word {00 ... 0 } is a member of all H^ hence Fi = {Hi,..., n„}. 

We dehne the optimal permutation as the permutation of the N words that achieves the minimum of C{Yf) such that tt^ < 1/2 
for every i. 

Let us denote the binary representation of the word with y{k). Looking at the N words of the vector y we say that a 
word y{k) is majorant to y(l) {y{k) P y{l)) if F; C F^. In other words, y{k) is majorant to y{l) iff for every bit in y{l) that 
equals zeros, the same bit equals zero in y{k). In the same manner a word y{k) is minorant to y{l) {y{k) < y{l)) if F^ C F;, 
that is iff for every bit in y{k) that equals zeros, the same bit equals zero in y{l). For example, the all zeros word {00... 0} 
is majorant to all the words, while the all ones word { 11 ... 1 } is minorant to all the word as non of its bits equals zeros. 

We say that y{k) is a largest minorant to y{l) if there is no other word that is minorant to y{l) and majorant to y{k). We 
also say that there is a partial order between y{k) and y{l) if one is majorant or minorant to the other. 

Theorem 2: The optimal solution must satisfy P{y{k)) > P{y{l)) for all y{k) < y{l)- 

Proof Assume there exists y{k) ^ y{l),k f I such that P{y{k)) < P{y{l)) which achieves the lowest (optimal) C{Y). Since 
yik) f y{l) then, by dehnition, F^ C Fj. This means there exists n_, which satisfies H^ G F; \ F^. Let us now exchange 
(swap) the words y{k) and y{l). Notice that this swapping only modifies H^ but leaves all other H^’s untouched. Therefore 
this swap leads to a lower C{Yf as the sum in 0 remains untouched apart from its summand which is lower than before. 
This contradicts the optimality assumption ■ 



IEEE TRANSACTIONS ON INFORMATION THEORY 


5 


We are now ready to present our algorithm. As a preceding step let us sort the probability vector p (of 2D such that pi < pj 
for all i < j. As described above, the all zeros word is majorant to all words and the all ones word is minorant to all words. 
Hence, the smallest probability pi and the largest probability pff are allocated to them respectively, as Theorem suggests. 
We now look at all words that are largest minorants to the all zeros word. 

Theorem guarantees that p 2 must be allocated to one of them. We shall therefore examine all of them. This leads to a 
search tree structure in which every node corresponds to an examined allocation of pi. In other words, for every allocation of 
Pi we shall further examine the allocation of to each of the largest minorants that are still not allocated. This process 
ends once all possible allocations are examined. 

The following example (Figure [T]) demonstrates our suggested algorithm with n = 3. The algorithm is initiated with the 
allocation of pi to the all zeros word. In order to illustrate the largest minorants to {000} we use the chart of the partial order 
at the bottom left of Figure As visualized in the chart, a word y{l) is minorant to y{k) if it lies within one of the shapes 
(e.g. ellipses, rectangles) that also contains y{k). Therefore, the largest minorants to {000} are {001}, {010} and {100}. As 
we choose to investigate the allocation of p 2 to {001} we notice that resulting largest minorants that are still not allocated are 
{010} and {100}. We then investigate the allocation of p^ to {010}, for example, and continue until all pi are allocated. 



Fig. 1; Search tree based algorithm with n = 3 


This search tree structure can be further improved by introducing a depth-hrst branch and bound enhancement. This means 
that before we examine a new branch in the tree we bound the minimal objective it can achieve (through allocation of the 
smallest unallocated probability to all of its unallocated words for example). 

The asymptotic computational complexity of this branch and bound search tree is quite difficult to approximate. There are, 
however, several cases where a simple solution exists (for example, for n = 2 it is easy to show that the solution is to allocate 
all four probabilities in ascending order). 

D. Generalized BICA via Piecewise Linear Relaxation Algorithm 

In this section we present a different approach which bounds the optimal solution from above as tightly as we want in 
0{n^ ■ 2"), where k dehnes how tight the bound is. Throughout this section we assume that fc is a hxed value, for complexity 
analysis purposes. 

Let us hrst notice that the problem we are dealing with ([^ is a concave minimization problem over a discrete permutation 
set which is NP hard. However, let us assume for the moment that instead of our “true” objective Q we have a simpler linear 
objective function. That is. 


n N 

L{Y_) = '^a^TTi + bi = '^CiP{Y = y{i))+d (4) 

i=l i=l 

where the last equality changes the summation over n bits to a summation over all N = 2"^ words (this change of summation 
is further discussed in Section |III-E| l. 

In order to minimize this objective over the N given probabilities p we simply sort these probabilities in a descending 
order and allocate them such that the largest probability goes with the smallest coefficient Ci and so on. Assuming both the 
coefficients and the probabilities are known and sorted in advance, the complexity of this procedure is linear in N. 

We now turn to the generalized binary ICA problem as dehned in ([^. Since our objective is concave we would hrst like 
to bound it from above with a piecewise linear function which contains k pieces, as shown in Figure In this paper we do 
not discuss the construction of such upper-bounding piecewise linear function, nor tuning methods for the value of k, and 
assume this function is given for any hxed k. Notice that the problem of approximating concave curves with piecewise linear 
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functions is very well studied (for example | |29) ) and may easily be modified to the upper bound case. We show that solving 
the piecewise linear problem approximates the solution to Q as closely as we want, in signihcantly lower complexity. 



Fig. 2; piecewise linear (k = 4) relaxation to the binary entropy 


From this point on we shall drop the previous assumption that 7r„ < 7r„_i < • • • < tti, for simplicity of presentation. First, 
we notice that all tt's are equivalent (in the sense that we can always interchange them and achieve the same result). This 
means we can hnd the optimal solution to the piecewise linear problem by going over all possible combinations of placing 
the n variables tt^ in the k different regions of the piecewise linear function. For each of these combinations we need to solve 
a linear problem (such as in Q, where the minimization is with respect to allocation of the N given probabilities p) with 
additional constraints on the ranges of each tt^. For example, assume n = 3 and the optimal solution is such that two tt's (e.g. 
TTi and 712) are at the hrst region, Ri, and 713 is at the second region, R2. Then, we need to solve the following constrained 
linear problem, 

minimize ai • (tti + 712) + 2 bi + 02 ■ tts + 62 
subject to TTi,772 G G R2 

where the minimization is over the allocation of the given {pi}^^i, which determine the corresponding tt^’s, as demonstrated 
in 0. This problem is again NP hard. However, if we attempt to solve it without the constraints we notice the following; 

1 ) If the collection of tt's which dehne the optimal solution to the unconstrained linear problem happens to meet the 
constraints then it is obviously the optimal solution with the constraints. 

2 ) If the collection of tt's of the optimal solution do not meet the constraints (say, 712 G R2) then, due to the concavity 
of the entropy function, there exists a different combination with a different constrained linear problem (again, over the 
allocation of the N given probabilities p), 

minimize oitti + 61 + 02(712 + 773) + 2&2 
subject to TTi G 772 , 773 S R2 

in which this set of 77's necessarily achieves a lower minimum (since a2X + &2 < o,iX + bi Vx G i?2)- 

Therefore, in order to hnd the optimal solution to the piecewise linear problem, all we need to do is to go over all possible 
combinations of placing the tt's in the k different regions, and for each combination solve an unconstrained linear problem 
(which is solved in a linear time in N). If the solution does not meet the constraint then it means the assumption that the 
optimal TTi reside within this combination’s regions is false. Otherwise, if the solution does meet the constraint, it is considered 
as a candidate for the global optimal solution. 

The number of combinations is equivalent to the number of ways of placing n identical balls in k boxes, which is (for a 
hxed k). 


n + k — 1 
n 


n + k — 1 
k-1 


^ {n + k — 1)^ ^ 


0(n^). 


( 6 ) 


Assuming the coefficients are all known and sorted in advance, for any hxed k the overall asympthotic complexity of our 
suggested algorithm, as n —>■ 00, is just 0(n^ • 2"). 
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E. The Relaxed Generalized BICA as a single matrix-vector multiplication 

It is important to notice that even though the asymptotic complexity of our approximation algorithm is 0(n^ • 2"), it takes 
a few seconds to run an entire experiment on a standard personal computer for as much as n = 10, for example. The reason is 
that the 2 " factor refers to the complexity of sorting a vector and multiplying two vectors, operations which are computationally 
efficient on most available software. Moreover, if we assume that the coefficients in Q are already calculated, sorted and stored 
in advance, we can place them in a matrix form A and multiply the matrix with the (sorted) vector p. The minimum of this 
product is exactly the solution to the linear approximation problem. Therefore, the practical complexity of the approximation 
algorithm drops to a single multiplication of a {n^ x 2 ") matrix with a ( 2 ” x 1 ) vector 

Let us extend the analysis of this matrix-vector multiplication approach. 

Each row of the matrix A corresponds to a single coefficient vector to be sorted and multiplied with the sorted probability 
vector p. Each of these coefficient vectors correspond to one possible way of placing n bits in k different regions of the 
piecewise linear function. Specifically, in each row, each of the n bits is assumed to reside in one of the k regions, hence 
it is assigned a slope ai as indicated in Q. Eor each row, our goal is to minimize L{Y_). Since this minimization is solved 
over the vector p we would like to change the summation accordingly. To do so, each entry of the coefficient vector (denoted 
as Ci in Q) is calculated by summing all the slopes that correspond to each tt^. Eor example, let us assume n = 3 where 
TTi, 712 S Ri, with a corresponding slope oi and intercept bi, while the 773 G R 2 with 02 and 62 - We use the following mapping; 
P(y = 000) =pi,P(y = 001) =p 2 ,...,P{Y^in) =ps. Therefore 


TTi = P(Yi = 0) = Pi + P2 + P3 + P4 

772 = PiY 2 = 0) = Pi +P 2 +P 5 +P 6 ■ 

773 = PiYs = 0) = Pi +P3+P5+P7 

The corresponding coefficients Ci are then the sum of rows of the following matrix 


This leads to a minimization problem 


Ol 

Ol 

02 

ai 

Ol 

0 

ai 

0 

02 

ai 

0 

0 

0 

Ol 

02 

0 

Ol 

0 

0 

0 

02 

0 

0 

0 


n 

L{Y.) = X! + bi = 

i=l 


ai(7ri + 712 ) + 02773 + 2bi + 62 = 


(7) 


( 8 ) 


(9) 


( 2 ai + 02)^1 + 2 aiP 2 + (oi -f 02)^3 + aiPi + (at + ^2)^5 + aiPe + a 2 P 7 + 26 i + 62 

where the coefficients of pi are simply the sum of the row in the matrix A. 

Now let us assume that n is greater than k (which is usually the case). It is easy to see that many of the coefficients Ci are 
actually identical in this case. Precisely, let us denote by k the number of assignments for the region. Then, the number 
of unique Ci coefficients is simply 

k 

+1) -1 

subject to — 77- Since we are interested in the worst case (of all rows of the matrix A), we need to find the non¬ 

identical coefficients. This is obtained when li is as ’’uniform” as possible. Therefore we can bound the number of non-identical 
coefficients from above by temporarily dropping the assumption that k’s are integers and letting k = ^ so that 

k ^ 

maxU(Zi + 1 ) < (^ -I- 1 ) =0{n^). (10) 

i=l 

This means that instead of sorting the 2" coefficients for each row of the matrix A, we only need to sort 0{n^) coefficients. 
Now, let us further assume that the data is generated from some known parametric model. In this case, some probabilities 
Pi may also be identical, so that the probability vector p may also not require 0(n • 2") operations to be sorted. Eor example. 
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if we assume a block independent structure, such that n components (bits) of the data are generated from - independent and 
identically distributed blocks of of size r, then it can be shown that the probability vector p contains at most 



( 11 ) 


non-identical elements pi. Another example is a first order stationary symmetric Markov model. In this case there only exists 
a quadratic number, n • (n — 1) + 2 = O(n^), of non-identical probabilities in p (see Appendix). 

This means that applying our relaxed Generalized BICA on such datasets may only require 0{n^) operations for the matrix 
A and a polynomial number of operations (in n) for the vector p\ hence our algorithm is reduced to run in a polynomial time 
in n. 

Notice that this derivation only considers the number of non-identical elements to be sorted through a quicksort algorithm. 
However, we also require the degree of each element (the number of times it appears) to eventually multiply the matrix A with 
the vector p. This, however, may be analytically derived through the same combinatorical considerations described above. 


F. Relaxed Generalized BICA Illustration and Experiments 

In order to validate our approximation algorithm we conduct several experiments. 

In the first experiment we randomly generate a probability distribution with n = 10 statistically independent components 
and mix its components in a non-linear fashion. We apply the approximation algorithm on this probability distribution with 
different values of k and compare the approximated minimum entropy we achieve (that is, the result of the upper-bound 
piecewise linear cost function) with the entropy of the vector. In addition, we apply the estimated parameters on the true 
objective ([^, to obtain an even closer approximation. Figure demonstrates the results we achieve, showing the convergence 
of the approximated entropy towards the real entropy as the number of linear pieces increases. Moreover, as we repeat this 
experiment several times (that is, randomly generate probability distributions and examine our approach for every single value 
of k), we see that the estimated parameters are equal to the independent parameters for k as small as 4, on average. 



Fig. 3: Piecewise linear approximation (solid line), entropy according to the estimated parameters (dashed-dot line) and the 
real entropy (dashed horizontal line), for a vector size n = 10 and different k linear pieces 


We further illustrate the use of the BICA tool by the following example on ASCII code. The ASCII code is a common 
standardized eight bit representation of western letters, numbers and symbols. We gather statistics on the frequency of each 
character, based on approximately 183 million words that appeared in the New York Times magazine p0| . We then apply 
the BICA (with k = 8, which is empirically sufficient) on this estimated probability distribution, to find a new eight bit 
representation of characters, such that the bits are ”as statistically independent” as possible. We find that the entropy of the 
joint probability distribution is 4.8289 bits, the sum of marginal entropies using ASCII representation is 5.5284 bits and the sum 
of marginal entropies after applying BICA is just 4.8532 bits. This means that there exists a different eight bit representation 
of characters which allows nearly full statistical independence of bits. Moreover, through this representation one can encode 
each of the eight bit separately without losing more than 0.025 bits, compared to encoding the eight bits altogether. 

IV. Generalized Independent Component Analysis Over Finite Alphabets 
A. Piecewise Linear Relaxation Algorithm - Exhaustive Search 

Let us extend the notation of the previous sections, denoting the number of components as n and the alphabet size as q. 
We would like to minimize H{Yi) where Yi is over an alphabet size q. We first notice that we need q — 1 parameters 
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to describe the multinomial distribution of Yi such that all of the parameters are not greater than \. Therefore, we can bound 
from above the marginal entropy with a piecewise linear function in the range [0, |], for each of the parameters of Yi. We 
call a (g — l)-tuple of regions a cell. As in previous sections we consider k, the number of linear pieces, to be hxed. Notice 
however, that as q and n increase, k needs also to take greater values in order to maintain the same level of accuracy. As 
mentioned above, in this work we do not discuss methods to determine the value of k for given q and n, and empirically 
evaluate it. 


Let us denote the number of cells to be visited in our approximation algorithm (Section IIl-D i as C. Since each parameter 
is approximated by k linear pieces and there are q — 1 parameters, C equals at most k‘^~^. 

In this case too, the parameters are exchangeable (in the sense that the entropy of a multinomial random variable with 
parameters {pi,P 2 ,P 3 } is equal to the entropy of a multinomial random variable with parameters {p 2 ,Pi,P 3 }, for example). 
Therefore, we do not need to visit all k‘^~^ cells, but only a unique subset which disregards permutation of parameters. In 
other words, the number of cells to be visited is bounded from above by the number of ways of choosing q—1 elements (the 
parameters) out of k elements (the number of pieces in each parameter) with repetition and without order. Notice this upper- 
bound (as opposed to full equality) is a result of not every combination being a feasible solution, as the sum of parameters 
may exceed 1. Assuming k is hxed and as q —>■ oo this equals 


q — 1 + k — 1 

9-1 


q - 1 + k - 1 
k-1 


< 


{q-l + k-1) 


k-1 


= 0{q’^). 


( 12 ) 


(fc-1)! 

Therefore, the number of cells we are to visit is simply C = min 0((3'^)). For sufficiently large q it follows that 

C = O ( 9 ^). As in the binary case we would like to examine all combinations of n entropy values in C cells. The number of 
iterations to calculate all possibilities is equal to the number of ways of placing n identical balls in C boxes, which is 


n + C-I ^ 
n 


(13) 


In addition, in each iteration we need to solve a linear problem which takes a linear complexity in g". Therefore, the overall 
complexity of our suggested algorithm is O {n^ ■ g"). 


We notice however that for a simple case where only two components are mixed (n = 2), we can calculate (13 1 explicitly 

C(C + 1) 


2 + C- 1 
2 


Putting this together with (12 1 , leads to an overall complexity which is polynomial in g, for a hxed k. 


g'=(g'= + l) 
2 


g2^ = O (g2fe+2) . 


(14) 


(15) 


Either way, the computational complexity of our suggested algorithm may result in an excessive runtime, to a point of in¬ 
feasibility, in the case of too many components or an alphabet size which is too large. This necessitates a heuristic improvement 
to reduce the runtime of our approach. 


B. Piecewise Linear Relaxation Algorithm - Objective Descent Search 

In Section [riI-D| we present the basic step of our suggested piecewise linear relaxation to the generalized binary ICA problem. 
As stated there, for each combination of placing n components in k pieces (of the piecewise linear approximation function) we 
solve a linear problem (LP). Then, if the solution happens to meet the constraints (falls within the ranges we assume) we keep 
it. Otherwise, due to the concavity of the entropy function, there exists a different combination with a different constrained 
linear problem in which this solution that we found necessarily achieves a lower minimum, so we disregard it. 

This leads to the following objective descent search method; instead of searching over all possible combinations we shall 
hrst guess an initial combination as a starting point (say, all components reside in a single cell). We then solve its unconstrained 
LP. If the solution meets the constraint we terminate. Otherwise we visit the cell that meets the constraints of the solution we 
found. We then solve the unconstrained LP of that cell and so on. We repeat this process for multiple random initialization. 

This suggested algorithm is obviously heuristic, which does not guarantee to provide the global optimal solution. Its 
performance strongly depends on the number of random initializations and the concavity of the searched domain. 

The following empirical evaluation demonstrates our suggested approach. In this experiment we randomly generate a 
probability distribution with n independent and identically distributed components over an alphabet size g. We then mix 
its components in a non-linear fashion. We apply the objective descent algorithm with a fixed number of initialization points 
(/ = 1000) and compare the approximated minimum sum of the marginal entropies with the true entropy of the vector. Figure 
1^ demonstrates the results we achieve for different values of n. We see that the objective descent algorithm approximates the 
correct components well for smaller values of n but as n increases the difference between the approximated minimum and the 
optimal minimum increases, as the problem becomes too overwhelming. 
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Algorithm 1 Relaxed Generalized ICA For Finite Alphabets via Gradient Search 

Require: p = the probability function of the random vector X 

Require: n = the number of components of X 

Require: C = the number of cells which upper-bound the objective. 

Require: I = the number of initializations. 

1 : opt ^ oo, where the variable opt is the minimum sum of marginal entropies we are looking for. 

2: y •(— 0, where V is the current cells the algorithm is visiting. 

3: S' •(— oo, where S is the solution of the current LR 
4: i ^ 1. 

5 : while i < / do 

6 : V randomly select an initial combination of placing n components in C cells 

7: S LP(y) solve an unconstrained linear prograsm which corresponds to the selected combination, as appears in Q. 

8 : if the solution falls within the bounds of the cell then 

9: if H{S) < opt then 

10 : opt ^ H{S), the sum of marginal entropies which correspond to the parameters found by the LP 

11: end if 

12 : i ^ i -f 1 

13: else 

14: V 4— the cells in which S reside. 

15: goto 7 

16: end if 

17: end while 
18: return opt 



Fig. 4; The real entropy (solid line) and the sum of marginal entropies as discovered by the objective descent algorithm, for 
an i.i.d vector over an alphabet size g = 4 and of varying number of components n 


V. Constrained Generalized Independent Component Analysis Over Finite Alphabets 

The outcome of the Generalized ICA, regardless of its computational cost, is a mapping from all g" words of X to words 
in g ■ {Ij • ■ ■ j ?}" { 1 ; • ■ ■ j ?}"> where q is the alphabet size of each component and n is the number of components. 

This mapping alone may be too large to store in a database, not to mention retrieve and use. Therefore, we may be interested 
in a sub-optimal mapping that is more compact. One way of defining such a mapping is through imposing a constraint on the 
maximal number of components Xi that every Yi depends on. This enables a simpler, function-like mapping that is easier and 
cheaper to implement on both hardware or software. 

In this section we turn our attention back to the binary case, q = 2. Recall that a balanced boolean function [ |3T| is a binary 
function whose output yields as many Os as Is over its input set. This means that for a uniformly random input string of bits, 
the probability of getting a 1 is 

We notice that each component of X must hold as many Os as Is, as the transformation is invertible (every possible word 
in X is mapped to a different word in X). This means that a necessary condition for X to be invertible is that each component 
Yi is a balanced boolean function (of as many as r components) of X. 
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A. Functions of Two Components (r = 2) 

Let us first focus on a simple case where r = 2. It is easy to verify that in this case, every balanced function is linear over 
GF{2). Notice that this no longer applies for r > 2, as the set of linear functions is just a subset of all balanced functions. 
Therefore, for r = 2, the solution is much easier to attain. For example, if we assume a structure where Yi = (mod n)) 

then there exist at most 2"+^ — 2 different invertible transformations to search, in order to find the optimal solution (see 
Appendix). Since the size of this search space is exponential in n, we may want to apply a heuristic based on Attux et al. 
to find a sub-optimal solution to this problem in fewer operations. 

In their work, Attux et al. suggest an immune-inspired algorithm for minimizing Q over XOR field (linear transformations). 
Their algorithm starts with a random ’’population” where each element in the population represents a valid transformation 
(W, an invertible matrix). At each step, the affinity function evaluates the objective (l — element 

in the population, which is subsequently cloned. Then, the clones suffer a mutation process that is inversely proportional to 
their affinity, generating a new set of individuals. This new set is evaluated again in order to select the individual with highest 
affinity, for each group of clone individuals and their parent individual. The process is finished with a random generation of d 
new individuals to replace the lowest affinity individuals in the population. The entire process is repeated until a pre-configured 
number of repetitions is executed. Then, the solution with the highest affinity is returned. It is important to notice that the 
mutation phase is implemented with a random bit resetting routine, with the constraint of accepting only new individuals that 
form a nonsingular matrix (invertible transformation). 

Therefore, we can simply adjust Attux et al.’s algorithm to satisfy our constraints by forcing each row in W to hold at most 
r elements equal to 1. This change shall be embedded in the mutation phase, right before we verify that W is non-singular. 

We notice that we can also apply this heuristic for cases where r > 2. However, the solution we get is limited to linear 
transformations, which is a substantially smaller domain than all invertible transformations. 

B. Functions of Multiple Components (r > 2) 

In the case where Yi is a function of r > 2 components we can no longer restrict ourselves to linear transformations in 
order to find the optimal transformation. We notice two major restrictions that define our search space; 

1) The transformation needs to be invertible. 

2) Yi is a function of no more than r components of X. 

We denote all the transformations in this search space as “feasible” solutions. The two constraints mentioned dramatically 
reduce the number of feasible solutions to our optimization problem, yet they define a highly non-trivial search domain; given 
a feasible solution it is not clear how to find a different feasible solution. This makes the problem much more difficult to 
tackle and necessitates an efficient methodology for visiting all feasible solutions. 

There are two immediate ways to search through all feasible solutions. The first is going over all transformations that hold 
the first constraint (invertible transformations) and for each examining if it holds the second constraint. The second option is 
to go over all transformations for which the second constraint holds and keep only those that are also invertible. Since there 
exist 2" invertible transformations, the first option is infeasible. The following Branch-and-Feasible algorithm is based on the 
second option; 

Let us first choose one of balanced boolean functions for Yi, such that Yi is a function of no more than r components 
of X. Then, we shall again choose a balanced boolean function (of no more than r components) for Y 2 . However, not all 
functions are permitted as some of them may result in a non invertible transformation already at this step. This way, we grow 
branches of feasible balanced boolean functions. We notice that as we get deeper into the tree there exist fewer permitted splits 
(as it is harder to find functions that result in invertible transformations). Moreover, some choices of functions may result in 
branches that already exist in the tree. These branches shall obviously be pruned. A detailed discussion regarding the properties 
of this algorithm and its practical abilities are subject to future research. 

VI. Applications 

A. Blind Source Separation 

We start this section with a classical ICA application, in Blind Source Separation (BSS). Assume there exist n independent 
(or practically ’’almost independent”) sources where each source is over an alphabet size q. These sources are mixed in an 
invertible, yet unknown manner. Our goal is to recover the sources from the mixture. 

For example, consider a case with n = 2 sources Xi,X 2 , where each source is over an alphabet size q. The sources are 
linearly mixed (over a finite field) such that Yi = Xi,Y 2 = Xi + X 2 . 

However, due to a malfunction, the symbols of Y2 are randomly shuffled, before it is handed to the receiver. Notice this 
mixture (including the malfunction) is unknown to the receiver, who receives Yi, Y 2 and strives to “blindly” recover Xi,X 2 . 
In this case any linearly based method such as GD or | |27) would fail to recover the sources as the mixture, along with the 
malfunction, is now a non-linear invertible transformation. Our method on the other hand, is designed especially for such cases, 
where no assumption is made on the mixture (other than being invertible). 
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To demonstrate this example we introduce two independent sources Xi,X 2 , over an alphabet size q. We apply the linear 
mixture Yi = Xi,Y 2 — Xi + X 2 and shuffle the symbols of ¥ 2 - We are then ready to apply (and compare) our suggested 
methods for finite alphabet sizes, which are the exhaustive search method (Section IV-A[ ) and the objective descent method 
(Section |1V-B| ). 

For the purpose of this experiment we assume both Xi and X 2 are distributed according to a Zipf’s law distribution, 


Pik;s,q) = 


k- 




with a parameter s = 1.6. The Zipf’s law distribution is a commonly used heavy-tailed distribution. This choice of distribution 
is further motivated in Section IVI-DI 

We apply our suggested algorithms for different alphabet sizes, with a fixed fc = 8, and with only 100 random initializations 
for the objective descent method. Figure [^presents the results we achieve. 




Fig. 5; BSS simulation results. Left; the lower curve is the joint entropy of FI, y2, the asterisks curve is the sum of marginal 
entropies using the exhaustive search method (Section IV-A| i while the curve with the squares corresponds the objective descent 
method (Section [IV-B| |. Right: the curve with the asterisks corresponds to the difference between the exhaustive search method 
and the joint entropy while the curve with the squares is the difference between the objective descent search method and the 
joint entropy. 


We first notice that both methods are capable of finding a transformation for which the sum of marginal entropies is very 
close to the joint entropy. This means our suggested methods succeed in separating the non-linear mixture Yi,Y 2 back to the 
statistically independent sources Xi,X 2 , as we expected. 

Looking at the chart on the right hand side of Figure we notice that the difference between the two methods tends 
to increase as the alphabet size q grows. This is not surprising since the search space grows while the number of random 
initializations remains fixed. However, the difference between the two methods is still practically negligible, as we can see 
from the chart on the left. This is especially important since the objective descent method takes significantly less time to apply 
as the alphabet size q grows. 


B. Source Coding - Generalization of Predictive Coding 

In his pioneering work from 1955, Peter Elias p2| presented a sequential scheme for source coding. He showed that instead 
of coding a process directly, the encoder should first predict the current value of the process from its past realizations and then 
code the prediction error. The prediction errors are encoded component-wise, as if they are statistically independent. This way 
the encoder strives to eliminate any statistical dependency and only codes the information that is still not known. 

We generalize this procedure by saying that instead of sequential generation of statistically independent components (predic¬ 
tion errors), an improved encoder shall assemble a vector of observations and make it ”as statistically independent as possible” 
all together. This can be shown to achieve improved performance (closer to the entropy of the vector) at the expense of working 
in batch. Though our suggested finite group ICA algorithms are of high complexity, it is important to notice that they are to 
be applied once, off-line, in order to generate the optimal mapping g : {0,1}" —{0,1}". 

Compared with Huffman coding, the main advantage of the predictive coding scheme is that unlike sequential Huffman 
coding, it does not require an exponentially increasing codebook and utilizes a fixed codebook for each component it codes. 
It does however require an exponentially increasing prediction book so that each realization of past observations is mapped to 
a single prediction of the present. This problem may be evaded by limiting the prediction module to a parametric structure, as 
discussed in Section IV] 
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C. Distributed Source Coding 

Let be a sequence of independent and identically distributed drawings of a pair of correlated discrete 

random variables Xi and X 2 . For lossless compression with Xi = Xi and X 2 = X 2 after decompression, we know from 
Shannon’s source coding theory that a rate given by the joint entropy H{Xi,X 2 ) of Xi,X 2 is sufficient if we are encoding 
them together, as in Figure |^a. For example, we can first compress Xi into H{Xi) bits and based on the complete knowledge 
of Xi at both the encoder and the decoder, we then compress X 2 into H{X 2 \Xi) bits. But what if Xi and X 2 must be 
separately encoded for some user to reconstruct them together? 

One simple way is to do separate coding with rate R = H{Xi) + H{X 2 ), which is greater than H{Xi,X 2 ) when Xi and 
X 2 are correlated. In a landmark paper Slepian and Wolf showed that R = H{Xi,X 2 ) is sufficient even for separate 
encoding of correlated sources (Figure |^b). The Slepian-Wolf theorem says that the achievable region of Distributed Source 
Coding (DSC) for discrete sources Xi and X 2 is given by i?i > H{Xi\X 2 ), R 2 > H{X 2 \Xi) and Ri + R 2 > H{Xi, X 2 ), 
which is shown in Figure The proof of the Slepian-Wolf theorem is based on random binning. Binning is a key concept in 
DSC and refers to partitioning the space of all possible outcomes of a random source into disjoint subsets or bins. However, 
the Slepian-Wolf approach suffers from several drawbacks. First, just as in Shannon’s channel coding theorem, the random 
binning argument used in the proof of the Slepian-Wolf theorem is asymptotic and non-constructive; In general, Slepian-Wolf 
codes are not easy to implement in practice. Second, as a result of non-random binning applied in most known practical codes, 
a complete zero-error regime is not applicable and some error is evident. Third, the Slepian-Wolf theorem is not easily scalable 
to a greater number of sources and non-binary alphabets. 

Let us now consider a case in which we are interested in a highly scalable distributed source coding system, in which the 
encoders are cheap, in the sense that they consist of a single predefined codebook and cannot be reconfigured post deployment. 
Such a setup applies, for example, where an extremely large number of sources need to be encoded in a distributed manner by 
multiple encoders, in order to reduce the computational cost required by each encoder. In addition, we would like the system 
to be lossless in the strict sense, such that the decompressed random variables are equal to the original sources for any block 
length. We notice that if the sources are statistically independent of each other then the rate of each encoder is simply given 
by the marginal entropy of that source, and the overall rate achieves Shannon’s joint entropy lower bound. Inspired by this, 
we ask ourselves whether there exists a pre-processing component which makes the sources as statistically independent as 
possible, prior to the distributed compression. As Figure [^c demonstrates, we would like to find an invertible transformation 
Y = g{x) such that Y_ is of the same dimension (and the same alphabet size). In addition we would like the components of 
Y_ to be as statistically independent as possible in the sense that '^H{Yi) min, where H{Yi) corresponds to the minimal 
rate of the encoder and 'Y^H{Yi) is bounded by H{JQ, the joint entropy of all sources. This is exactly achieved by our 
suggested generalized ICA over finite group algorithms, where both the transmitter and the receiver are assumed to know Px, 
and are therefore capable of finding the corresponding mapping (and its inverse). 



(b) 



(c) 


Fig. 6: (a) Joint encoding of Xi,X 2 . The encoders collaborate and a rate H{Xi,X 2 ) is sufficient, (b) Distributed encoding 
of Xi,X 2 . The encoders do not collaborate. The Slepian-Wolf theorem says that a rate H{Xi,X 2 ) is also sufficient provided 
that the decoding of Xi and X 2 is done jointly, (c) Distributed encoding with a pre-processing stage which enables a scalable, 
zero-error system with cheap fixed codebook encoders working in a rate close to H{Xi, X 2 ). 
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However, the pre-processing stage is a somewhat centralized component. That is, one may claim that by sensing all sources X, 
the centralized component can communicate to each encoder the realizations of the other sources. This shall enable conditional 
coding as in the centralized source coding scheme described above. This kind of regime, however, entails expensive and 
sophisticated encoders which are able to communicate and store multiple codebooks (exponential in the number of sources). 



Still, assume we are interested in eliminating, or at least reducing, centralized components. This means we would like each 
component T) to be dependent on as few components of X as possible. Obviously we cannot simply apply our generalized 
ICA algorithms on X as it may result with sophisticated dependencies between T) and the components of X. However, we 
may restrict these dependencies by applying the constrained version of our method, discussed in Section |V] For example, 
we may dehne a spatial pattern in which each component Yi depends on no more than r adjacent components of X, such 
that Yi = fi{xi,... ,Xi^r (mod n))- This pattern is analyzed for r = 2 in Section 
discussed in that section. 

This method of restricting the number of components each encoder is responsible for, followed by generalized ICA mapping, 
can be viewed as a trade-off between how centralized/decoupled the encoder is and the achievable rate which can be achieved 
in a lossless source coding regime for any block length. 


V-A 


and can be extended to r > 2 as later 


D. Source Coding of Large Alphabet Data 

Let be a sequence of N i.i.d. vector realizations we are to encode. Assume the dimension of the vector x is n and 

each of its components take values over an alphabet size q. Therefore, the alphabet size of the vectors x{i) is q”. 

A compressed representation of a dataset involves two components - the compressed data itself and an overhead redundancy 
(for example, in the form of a dictionary for decompression purpose). Encoding the data requires at least N times the data’s 
empirical entropy, H{X). This is attained through entropy coding according to the data’s empirical distribution. The redundancy, 
on the other hand, may be quantihed in several ways. One common way of measuring the coding redundancy is through the 
minimax criterion p4) . Szpankowski and Weinberger p5] derived the minimax point-wise redundancy for various ranges of 
sequence length and alphabet size. Throughout this section we use their results to quantify the encoding redundancy. 

Large alphabet source coding considers the case where the alphabet size g" is of order larger than the number of realizations 
X ^ g". Unlike the case where N ^ g", here the coding redundancy may not be negligible. Precisely, the minimax point-wise 
redundancy behaves asymptotically as 

dN,q^-Nlog^. (16) 

We argue that in some setups it is better to split the components of the data into blocks, with b components in each block, 
and encode the blocks separately. Notice that we may set the value of b so that the blocks are no longer considered as over 
a large alphabet size {N ^ g^). This way, the redundancy of encoding each block separately is again negligible, at the cost 
of longer averaged codeword length. For simplicity of notation we dehne the number of blocks as B, and assume B — ^ is 
a natural number. Therefore, according to encoding the n components all together takes a total of about 

N ■ H(X) + Nlog^ 


(17) 
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bits, while the block-wise compression takes about 


N 


B 

■E- 

i=i 




- 1 


1 ^ 
log-j 


(18) 


bits, where the first term is N times the sum of B block entropies and the second term is B times the redundancy of each 
block when {N ^ q^). 

Two subsequent questions arise from this setup: 

1) What is the optimal value of b that minimizes (18 1 ? 

2) Given a fixed value of b, how can we re-arrange n components into B blocks so that the averaged codeword length 
(lower-bounded by the empirical entropy), together with the redundancy, is as small as possible? 

Let us start by fixing b and focusing on the second question. 

A naive approach is to exhaustively or randomly search for all possible combinations of clustering n components into B 
blocks. Assuming n is quite large, an exhaustive search is practically infeasible; hence a different method is required. We 
suggest applying our generalized ICA tool as an upper-bound search method for efficiently searching for a minimal possible 
averaged codeword length. 

As in previous sections we define Y = g{2C), where g is some invertible transformation of X. Every block of the vector Y_ 
satisfies 


b 

E 




(19) 


where Y^^'^ refers to the block of size b. This means that the sum of marginal empirical entropies of each block bounds 
the empirical entropy of the block from above. Summing over all B blocks in ([T^ we have 


B b n B 

H{Y_^^'>) (20) 

j — 1 i—1 i—1 j—^ 

which indicates that the sum of the block empirical entropies is upper bounded by the marginal empirical entropies of the 
components of Y_ (with equality iff the components are independently distributed). 

Our suggested scheme works as follows: We first randomly partition the n components into B blocks. We estimate the joint 
probability of each block and apply the generalized ICA on it. The sum of marginal empirical entropies (of each block) is an 
upper bound on the empirical entropy of each block, as indicated in ( [T^ . Now, let us randomly permute the n components of 
the vector Y. Here by ’’permute” we refer to an exchange of positions of F’s components, as opposed to permuting the words, 
like we did in previous sections. By doing so, the sum of marginal empirical entropies of the entire vector J27=i^iYi) is 
maintained. However, as we now apply the generalized ICA on each of the (new) blocks, we necessarily minimize (or at least 
do not increase) the sum of marginal empirical entropies of the (new) block, hence minimize the sum of marginal empirical 
entropies of the entire vector Y. This means that we minimize the upper bound of the sum of block empirical entropies, as 
( |20| i suggests. In other words, we show that in each iteration we decrease (at least do not increase) an upper bound on our 
objective. We terminate once a maximal number of iterations is reached or we can no longer decrease the sum of marginal 
empirical entropies. 

Algorithm 2 summarizes our suggested scheme. 

Therefore, encoding the data if we terminate at iteration Iq takes a total of about 




a -1 N . 

B —-— log -r -b loB-bq^ + Jon log n 
2 


( 21 ) 


where the first term refers to the sum of block empirical entropies at the Jq iteration, the third term refers to the representation 
of Iq - B invertible transformation of each block during the process until Iq, and the fourth term refers to the bit permutations 
at the beginning of each iteration. 

Hence, in order to minimize (21 1 we need to find the optimal trade-off between a low value of and a low 

iteration number Iq. We may then apply this technique with different values of b to find the best compression scheme over all 
block sizes. 

In order to demonstrate our suggested method we generate a dataset according to Zipf distribution. The Zipf distribution is 
a discrete distribution commonly used for modeling of natural (real-world) quantities. It is widely used in physical and social 
sciences, linguistics, economics and many other fields. We draw N = 10® realizations from the Zipf distribution with a parameter 


1.4 and represent each realization through a 24 bit representation. For this dataset we attain an empirical entropy of 5.55 bits. 
Therefore, compressing the drawn realizations in its given 2^^ alphabet size takes a total of 10® x 5.55-1-10® x log = 9.62-10® 
bits, according to 
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Algorithm 2 Large Alphabet Source Coding via Generalized ICA over Finite Alphabets 


Require: A= 
Require: n - 
Require: B= 
Require: I = 

1 
2 

3 

4 

5 

6 

7 

8 
9 

10 
11 
12 

13 

14 

15 


a vector of realizations N. 

■■ the dimension of A. 
the number of blocks, 
the number of iterations. 

Ffm(l) ^ H(Yi) , where Hm{i) is the sum of marginal empirical entropies of the vector A at iteration 

eps ^ 10“^° 

i <— 2, where i be an iterations counter, 
while i < I and Hm{i) < Hm{i — 1) do 

Randomly partition n components into B blocks 
for j = 1 to i? do 

[i) ^ the empirical entropy of the block at iteration i. 

Apply Generalized ICA over Finite Alphabets on the block 

(i) the sum of marginal empirical entropies of the block at iteration i. 

end for 

(*)■ 


Hb{i) ^ Ej=i where Hb{i) is the sum of block empirical entropies of the vector A at iteration i. 

7 ^ i + 1 

end while ^ 
return 


Let us now apply a block-wise compression. We first demonstrate the behavior of our suggested approach with four blocks 
(B = 4) in Figure compared with a naive search which randomly searches for all possible shufflings of n components into 
B blocks, (as described above). 



Fig. 8: Large Alphabet Source Coding via Generalized ICA over Finite Alphabets. The horizontal line indicated the empirical 
entropy of A. The upper curve is the sum of marginal empirical entropies which upper bounds the lower curve, the sum of 
block empirical entropies (the outcome of our suggested approach at each iteration). The fluctuating line is the sum of block 
empirical entropies as returned by a naive random search 


We first notice that the naive search is able to find a minimum of E^=i = 6 bits. This leads to a total of about 

6 • 10® bits for the entire dataset, which is significantly lower than the 9.62 • 10® bits we achieve through a single block 
compression. It is important to mention that this minimum is found while randomly searching over a much larger number of 
iterations than indicated in Figure ([^. 

Applying our suggested scheme, we minimize (21 1 over /q = 50 and E^i = 5.85 to achieve a total of 5.93 • 10® 

bits for the entire dataset, saving about 7 • 10^ bits compared to the naive search. Table |i] summarizes the results we achieve 
for different block sizes B, compared with a naive search. 
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TABLE I; Comparison Between Block-Wise Compression via a Naive Search and a Generalized ICA Method. ”Our approach” 
refers to a Large Alphabet Source Coding via Generalized ICA over Finite Alphabets while the naive search refers to a random 
search result, when clustering n components into B blocks (that is, the minimum over 1000 random trails) 


Number of 

Number of Components 

Naive Search Minimum of 

Naive Search 

Our Approach 

Our Approach 

Our Approach 

Blocks 

in each Block 


Total Compression 

Optimal lo 


Total Compression 

6 

4 

6.1 

6.1 ■ 10® 

70 

6 

6.03 ■ 10 ® 

4 

6 

6 

6 ■ 10® 

50 

5.85 

5.93 ■ 10 ® 

3 

8 

5.9 

5.9 ■ 10® 

10 

5.8 

5.86 ■ 10 ® 


VII. Conclusion 

In this work we consider a generalized ICA over finite alphabets framework where we drop the common assumptions on 
the underlying model. Specifically, we attempt to decompose a given ’’categorical” vector to its ”as statistically independent 
as possible” components with no further assumptions, as introduced by Barlow 0. 

We first focus on the binary case and propose three algorithms to address this class of problems. In the first algorithm we 
assume that there exists a set of independent components that were mixed to generate the observed vector. We show that these 
independent components are recovered in a combinatorial manner in 0{n ■ 2") operations. The second algorithm drops this 
assumption and accurately solves the generalized BICA problem through a branch and bound search tree structure. Then, we 
propose a third algorithm which bounds our objective from above as tightly as we want to find an approximated solution in 
0{n^ ■ 2”) with k being the approximation accuracy parameter. We further show that this algorithm can be formulated as a 
single matrix-vector multiplications and under some generative model assumption the complexity is dropped to be polynomial 
in n. 

Following that we extend our methods to deal with a larger alphabet size. This case necessitates a heuristic approach to deal 
with the super exponentially increasing complexity. An objective descent search method is presented for that purpose. 

In addition we show that in some applications we may get a reduced complexity using a constrained version of the generalized 
ICA framework. This version incorporates additional restrictions on the number of components r in X that each element of Y 
is dependent on. This guarantees that the outcome of the generalized ICA algorithm is not an exponentially increasing mapping 
(in n) but a simpler and more compact structure. We show this problem can be solved through known heuristics when r = 2 
but requires a branch and feasible approach for larger values of r. 

The code for our suggested methods in publically available for future research. 

We conclude the paper by presenting four applications, three of which are in the field of source coding. The first source 
coding application extends the predictive coding scheme. We show that our generalized ICA method may apply to this setup to 
enhance its performance. This means that instead of sequentially generating statistically independent components (prediction 
errors) we generate a (more) statistically independent vector at the cost of introducing a time delay. Second, we show that our 
method applies for distributed source coding. We claim that by making the source ”as statistically independent as possible” prior 
to encoding each component separately, we achieve lossless compression for any block size while minimizing the minimal rate 
we may work at. We conclude by discussing the problem of encoding data over a large alphabet size. Our suggested scheme, 
based on the generalized ICA method, shows to efficiently find a transformation that achieves a low entropy for a block based 
entropy encoder. 

Although we focus our interest on source coding, the generalized ICA over finite alphabets method is a universal tool which 
aims to decompose data into its fundamental components. We believe that the theoretical properties we introduce, together 
with the practical algorithms which utilize a variety of setups, make this tool applicable to many disciplines. 

Appendix 

Theorem 3: Assume a binary random vector X. & {0,1}" is generated from a first order stationary symmetric Markov model. 
Then, the joint probability of X, Px = Pi, ■ ■ ■ ,Pn only contains n ■ (n — 1) + 2 unique (non-identical) elements of pi,... ,pn- 

Proof We first notice that for a binary, symmetric and stationary Markov model, the probability of each word is solely 
determined by 

1) The value of the first (most significant) bit 

2) The number of elements equal 1 (or equivalently 0) 

3) The number of transitions from 1 to 0 (and vice versa). 

* https://sites.google.com/site/amichaipainsky 
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For example, for n = 4 the probability of 0100 equals the probability of 0010, while it is not equal to the probability of 
0001 . 

Assume the number of transitions, denoted in r, is even. Further, assume that the hrst (most signihcant) bit equals zero. 
Then, the number of words with a unique probability is 


n—2 ^ 2 n—2 

Ui= H ( 22 ) 

r—2, r is even k—^ r—2, r is even 

where the summation over r corresponds to the number of transitions, while the summation over k is with respect to the 
number of 1 elements given r. For example, for n = 4 and r = 2 we have the words 0100,0010 for fc = 1 (which have 
the same probability as discussed above), and the word 0110 for fc = 2. In the same manner, assuming again that the most 
signihcant bit is 0 but now r is odd, we have 


C/2 


t'4- 1 

n-1 n-1 

r—2, r is odd r+i r—2, r is odd 

’ ft— 2 


(23) 


Putting together (221 and (231 we have that number of words with a unique probability, assuming the most signihcant bit 
is 0, equals 


+ + (24) 

r—1 

The same derivation holds for the case where the most signihcant bit is 1, leading to a total of n • (n — 1) + 2 words with 
a unique probability ■ 

Theorem 4 : There are 2"+^ — 2 invertible n x n matrices over the binary held which satisfy 

{Aij = 0 Vi, j s.t. j ^ i , j ^ i + I (mod n)} . 

Proof Let us hrst denote the diagonal of a matrix A, ^ as di while 4,2 refers to the secondary diagonal above it, i+i. We 
denote the case where a diagonal dk is all ones as ds, = 1. The only entries of A which can take non-zero values are the entries 
on di,d2 and the bottom left entry, A„_i. A matrix is invertible iff its determinant is non-zero. Laplace’s formula expresses 
the determinant of a matrix in terms of its minors. The minor Mij is dehned to be the determinant of the (n — 1) x (n — 1) 
matrix that results from A by removing the row and the column. The determinant of A is then given by 

n 

det{A) = Y, (25) 

for any column 1 < j < n. As we limit ourselves to the binary held the sum operator is simply a XOR between two binary 
variables. Let us derive the conditions for det(A) ^ 0. First, assume A„,i = 0. In this case A is a triangular matrix and 
det(A) = nr=i Therefore, the determinant of A is not equal to zero iff the diagonal di is all ones. Hence, we have 2”“^ 
values for d2 such that det(A) 7 / 0 in this case. 

In the case where A„_i = 1 we have det(A) = Ai^iMi^i + Mn,i- This means det(A) 7 / 0 iff either Ai^iMi 1 = 1 or 
]^n,i — but not both of them together. We already found that Ai iMi 1 = 1 iff di is all ones and in the same way 1 = 1 
iff d2 is all ones. Therefore det(A) 7 / 0 iff either di = 1 or ^2 = 1- This means we have (2”“^ — l) -f (2" — 1) different 
matrices which are invertible in this case. 

Putting this all together we have 2"“^ -|- (2" — 1) + (2”“^ — = 2”+^ — 2 different matrices which satisfy the conditions 

of A and are invertible ■ 
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